Chicago PD: Predicting & Inferring Police Misconduct

Team name

Author

Alan Senkus, Ryan Payne, Alex Ortega, Willem Luyten

Published

March 15, 2023

Abstract
_Domestic Violence, Substance Abuse, and Race all play a role in whether a police officer is disciplined in their sustained allegation, or if the allegation gets sustained in the first place. However, none of those are strong enough predictors to detect which officers CPD should focus their internal affairs resources towards.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, LogisticRegression
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score,  train_test_split, KFold
from scipy import stats
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import itertools 
import time 
import warnings

0.1 Data quality check / cleaning / preparation

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks. An example is given below.

0.1.1 Data quality check

By Ryan Payne

The code below visualizes the distribution of all the variables in the dataset, and their association with the response.

#...Distribution of continuous variables...#
#...Distribution of categorical variables...#
#...Association of the response with the predictors...#

0.1.2 Data cleaning

By Ryan Payne

From the data quality check we realized that:

  1. The data did not have a discplined response column and so we had to construct it from the observations in “final_outcome”.
  2. The data came in seperate csv files. For example, a csv file containing roster information, regional demographics, officer salaries, victim information and more. We needed to decide which of these csv files to use. We ended up deciding (based on missing data criteria) that’d we’d focus on only those data pertaining to the officers and not victims. These datasets needed to be merged with one another on the officer’s unique ID numner.
  3. The final_outcome column had double entries: once in lower case and once in upper case that needed to be combined.
  4. In general, there were lots of formatting issues that needed to be resolved.
  5. There were cateogries with very few (less than 0.01%) instances that needed to be either removed or consolidated. For example, we consolidated all gang related units in unit_description into one unit category.

The data came in three seperate csv files that needed to be merged and cleaned before we could begin our ananlysis. These three seperate files contained information on (1) the accused police officers (their unit, age, etc.) (2) salary information and (3) other roster information. The below cells walk through how these three files were combined and cleaned to create our final dataset.

0.2 Accused Data

# ACCUSED DATA 
accused_df = pd.read_csv('./project_data_unified/complaints/complaints-accused.csv')

# Remove rows for which final_outcome = NaN
accused_df = accused_df.dropna(subset=['final_outcome'])
# Remove rows for which final_outcome = "Unknown" (pretty much same thing as NaN)
accused_df = accused_df.loc[accused_df['final_outcome'] != 'Unknown']
# Rename complaint_category as complaint_description, per reference data
accused_df = accused_df.rename(columns={'complaint_category':'complaint_descr'})
# Drop reccommended finding and outcome bacause (a) >50% NaN and (b) mostly redundant
accused_df = accused_df.drop(columns={'recc_finding', 'recc_outcome'})


# Combine like values  
accused_df['final_outcome'] = accused_df['final_outcome'].astype(str)
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('NO ACTION TAKEN', 'No Action Taken'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('**PENALTY NOT SERVED', 'Penalty Not Served'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('REPRIMAND', 'Reprimand'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('SEPARATION', 'Separation'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('REINSTATED COURT ACT', 'Reinstated by Court Action'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('RESIGNED -NOT SERVED', 'Resigned'))
accused_df['final_outcome'] = accused_df['final_outcome'].apply(lambda x: x.replace('REINSTATED POLICE BD', 'Reinstated by Police Board'))


# treat '15 Day Suspension' same as '15 DAY SUSPENSION', for each #-day combination 
import re 
accused_df['final_outcome'] = accused_df['final_outcome'].astype(str)

def standardize_case(s):  
    # Check if the string contains a number
    if any(char.isdigit() for char in s):
        # Extract the numeric part of the string using regular expressions
        num_str = re.findall(r'\d+', s)[0]
        # Use regular expressions to replace the text part with a lowercased version
        text = re.sub(r'[a-zA-Z]+', lambda m: m.group(0).lower(), s.replace(num_str, ''))
        # Combine the standardized text and numeric parts of the string
        return num_str + ' ' + text.strip()
    else:
        return s

# Update final_outcome 
accused_df['final_outcome'] = accused_df['final_outcome'].apply(standardize_case)
FileNotFoundError: [Errno 2] No such file or directory: './project_data_unified/complaints/complaints-accused.csv'

Next, we needed to create our response variable (disciplined = 1 or discplined = 0) from the categories in “final outcome.” We decided not to treat “Reprimand” as a form of discpline, since it often used as a token punishment. Importantly for the data, however, reprimands make up the majority of discplinary actions. It’s removal, therefore, reduces the number of positive responses (discplined = 1) quite substantially.

# Filter out all the final_outcomes that DON'T count as discpline
disc_outcomes = accused_df['final_outcome'].value_counts().index.difference(['No Action Taken', 'Reprimand', 'RESIGNED -NOT SERVED', 
                                                              'SUSTAINED-NO PENALTY', 'Violation Noted', 'Penalty Not Served', 
                                                                'Reinstated by Police Board', 'Resigned'])

# Create the binary, discplined or not, outcome variable 
accused_df['disciplined'] = accused_df['final_outcome'].apply(lambda x: 1 if x in disc_outcomes else 0)

Just as with “discplined”, we also needed to create a our second response variable, sustained (sustained = 1 and sustained = 0) from the categories in “final_finding”

# Create sustained or not sustained binary (final_find == NS)
accused_df['final_finding'].value_counts()
accused_df['sustained'] = accused_df['final_finding'].apply(lambda x: 1 if x == 'SU' else 0)

# Keep in mind, final_finding has other categories 

The accused_df contains a column “complaint_code” which corresponds to a complaint category, such as false arrest and unfit for duty. These description-code pairings were in another file that needed to be merged with accused_df on “complaint_code”

# Complaint categories 
complaint_ids = pd.read_excel('./project_data_unified/Complaint_Categories.xlsx')
complaint_ids = complaint_ids.rename(columns={111:'complaint_code', 'CATEGORY':'complaint_category', 
                                             'ON / OFF DUTY':'on_off_duty'})
# Collapse Excessive use of force into use of force
complaint_ids['complaint_category'] = complaint_ids['complaint_category'].apply(lambda x: x.replace('Excessive Force', 'Use of Force'))
# Combine racial profiling and first amendment into 'Other', since so few n
complaint_ids['complaint_category'] = complaint_ids['complaint_category'].apply(lambda x: x.replace('Racial Profiling', 'Other'))
complaint_ids['complaint_category'] = complaint_ids['complaint_category'].apply(lambda x: x.replace('First Amendment', 'Other'))

# Add ON/OFF duty and complaint category to accused_df from reference data
accused_df = pd.merge(accused_df, complaint_ids.loc[:,['complaint_code', 'on_off_duty', 'complaint_category']])
/Users/ryanpayne/opt/anaconda3/lib/python3.9/site-packages/openpyxl/worksheet/_reader.py:296: UserWarning: Failed to load a conditional formatting rule. It will be discarded. Cause: expected <class 'openpyxl.worksheet.cell_range.MultiCellRange'>
  warn(msg)

1 Roster

Most of the cleaning performed below consists of renaming, replacing, and consolidating certain values to (a) make the data intepretable both to us and statmodels’ functions and (b) remove as much imbalance in particular categories as appropriate.

# Police roster information

roster_df = pd.read_csv('./project_data_unified/roster/roster_1936-2017_2017-04.csv')
star_nums = ['star1', 'star2', 'star3', 'star4', 'star5', 'star6', 'star7', 'star8', 'star9', 'star10']
to_drop = ['last_name_NS','first_name_NS', 'middle_initial', 'middle_initial2', 'suffix_name','roster_1936-2017_2017-04_ID',
          'row_id', 'merge', 'birth_year', 'resignation_date', 'link_UID']+star_nums
# Drop columns
roster_df = roster_df.drop(columns=to_drop)

# Combine all police officers into one category
# Identify rows containing 'PO' and replace with 'POLICE OFFICER'
mask = roster_df['current_rank'].str.contains('PO') & ~roster_df['current_rank'].str.contains('POLICE OFFICER') & ~roster_df['current_rank'].str.contains('PO AS DETECTIVE') 
roster_df.loc[mask, 'current_rank'] = 'POLICE OFFICER'

# Manually rename renaming "PO" entries
roster_df['current_rank'] = roster_df['current_rank'].apply(lambda x: x.replace('P O ASSGN SEC SPEC', 'POLICE OFFICER'))
roster_df['current_rank'] = roster_df['current_rank'].apply(lambda x: x.replace('P.O. ASSIGNED AS HELICOPTER PILOT', 'POLICE OFFICER'))
roster_df['current_rank'] = roster_df['current_rank'].apply(lambda x: x.replace('DEP CHIEF', 'CHIEF'))

# Collapse the rest of the ranks into Other
to_replace = roster_df['current_rank'].value_counts().index.difference(['POLICE OFFICER', 'PO AS DETECTIVE', 'UNKNOWN', 'COMMANDER',
                                                      'DEP CHIEF', 'CHIEF'])

replace_dict = {level: 'Other' for level in to_replace}
roster_df['current_rank'] = roster_df['current_rank'].replace(replace_dict)

# Reduce number of unit descriptions 
counts = roster_df['unit_description'].value_counts()
roster_df.loc[roster_df['unit_description'].isin(counts[counts < 50].index), 'unit_description'] = 'Other'

# Rename columns 
roster_df = roster_df.rename(columns={'gender':'officer_gender', 'race':'officer_race', 'birth_year':'officer_birth_year', 
                                     'current_age':'officer_age'})

# Combine Native American and Pacific Islander, for balance reasons
roster_df['officer_race'] = roster_df['officer_race'].replace('NATIVE AMERICAN/ALASKAN NATIVE','ASIAN/PACIFIC ISLANDER')
roster_df['officer_race'] = roster_df['officer_race'].replace('ASIAN/PACIFIC ISLANDER', 'ASIA_PAC/NATIV_AM')

# Remove NaN and consolidate unknowns 
roster_df['unit_description'] = roster_df['unit_description'].fillna('Unknown')
roster_df['unit_description'] = roster_df['unit_description'].apply(lambda x: x.replace('UNKNOWN', 'Unknown'))

# Consolidate traffic units
mask = roster_df['unit_description'].str.contains('TRAFFIC')
roster_df.loc[mask, 'unit_description'] = 'TRAFFIC RELATED'

# Consolidating Airport units 
mask = roster_df['unit_description'].str.contains('AIRPORT')
roster_df.loc[mask, 'unit_description'] = 'AIRPORT RELATED'

# Consolidating Public housing units 
mask = roster_df['unit_description'].str.contains('PUBLIC HOUSING')
roster_df.loc[mask, 'unit_description'] = 'PUBLIC HOUSING'

# Consolidating YOUTH and juvenile units 
mask = (roster_df['unit_description'].str.contains('YOUTH') | roster_df['unit_description'].str.contains('JUVENILE'))
roster_df.loc[mask, 'unit_description'] = 'YOUTH_JUVENILE'

# Consolidate gang units 
mask = roster_df['unit_description'].str.contains('GANG') 
roster_df.loc[mask, 'unit_description'] = 'GANG ENFORCEMENT_INVESTIGATION'

# Consolidate patrol units 
mask = roster_df['unit_description'].str.contains('PATROL') 
roster_df.loc[mask, 'unit_description'] = 'PATROL'

# Consolidate forensic units 
mask = roster_df['unit_description'].str.contains('FORENSIC') 
roster_df.loc[mask, 'unit_description'] = 'FORENSIC'

# Consolidate property units 
mask = roster_df['unit_description'].str.contains('PROPERTY') |  roster_df['unit_description'].str.contains('ASSET')
roster_df.loc[mask, 'unit_description'] = 'PROPERTY RELATED'

# Consoldiate minority units again, keep district 013 separate
counts = roster_df['unit_description'].value_counts()
roster_df.loc[roster_df['unit_description'].isin(counts[counts < 50].index), 'unit_description'] = 'Other'

2 Salary

# Salary data
salary_df = pd.read_csv('./project_data_unified/salary/salary_2002-2017_2017-09.csv')
salary_df['rank'] = salary_df['rank'].astype(str)

# Delete redundant columns 
salary_df = salary_df.filter(regex='^(?!salary-)')
salary_df = salary_df.drop(columns=['salary_2002-2017_2017-09_ID.1', 'salary_2002-2017_2017-09_ID', 'org_hire_date', 'year', 
                                   'spp_date', 'row_id'])

# Rename
salary_df['rank'] = salary_df['rank'].replace('POLICE OFFICER (ASSIGNED AS DETECTIVE)','PO AS DETECTIVE')

# Collapsing POLICE OFFICERS
mask = salary_df['rank'].str.contains('POLICE OFFICER') 
salary_df.loc[mask, 'rank'] = 'POLICE OFFICER'

# Collapsing other POLICE...
mask = salary_df['rank'].str.contains('POLICE') & ~salary_df['rank'].str.contains('POLICE OFFICER (ASSIGNED AS DETECTIVE)')
salary_df.loc[mask, 'rank'] = 'POLICE OTHER'

# RENAME
salary_df = salary_df.rename(columns={'POLICE OFFICER (ASSIGNED AS DETECTIVE)':'PO AS DETECTIVE'})

to_replace = salary_df['rank'].value_counts().index.difference(['POLICE OFFICER', 'SERGEANT', 'COMMANDER', 'CHIEF', 'DEPUTY CHIEF',
                                                                'LIEUTENANT', 'EXPLOSIVES TECHNICIAN I', 'POLICE OTHER', 'PO AS DETECTIVE'])
replace_dict = {level: 'Other' for level in to_replace}
salary_df['rank'] = salary_df['rank'].replace(replace_dict)
/var/folders/q1/m9fsct652l59y6ctvtwrwg7c0000gn/T/ipykernel_1161/809391773.py:18: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.
  mask = salary_df['rank'].str.contains('POLICE') & ~salary_df['rank'].str.contains('POLICE OFFICER (ASSIGNED AS DETECTIVE)')

2.1 Merge Dataframes

# Merge roster and salary data
roster_sal_df = pd.merge(salary_df, roster_df, on='UID')
# Merge officer roster-salary info with accused_df 
merged_df = pd.merge(accused_df, roster_sal_df, on='UID', how='inner')
# Drop duplicates
merged_df = merged_df.drop_duplicates(subset=['UID','cr_id', 'complaint_code']) 
#Drop redundant
merged_df = merged_df.drop(columns='link_UID_y')
# Rename
merged_df = merged_df.rename(columns={'link_UID_x':'link_UID'})

2.2 Final cleaning: dummy variables, filtering unusable variables, etc.

# Create copy of data frame 
df = merged_df.copy()

# Ignore regular expression warning messages
warnings.filterwarnings('ignore')

# DUMMIES::

# unit_description dummy df
unit_df = (pd.get_dummies(df['unit_description'])).astype(float)
unit_df = unit_df.drop(columns='Unknown')
# complaint_category dummy df
complaint_df = (pd.get_dummies(df['complaint_category'])).astype(float)
complaint_df = complaint_df.drop(columns='Other')
# officer gender dummy 
gender_df = pd.get_dummies(df['officer_gender'])
# drop female (only need 1 dummy to encode binary)
gender_df = gender_df.drop(columns='FEMALE')
# officer Race dummy 
race_df = pd.get_dummies(df['officer_race'])
# on/off duty dummy 
duty_df = pd.get_dummies(df['on_off_duty'])
# Drop OFF dummy (only need 1 to encode binary)
duty_df = duty_df.drop(columns='OFF')
# current rank of officer dummies 
df['current_rank'] = df['current_rank'].astype(str)
current_rank_df = pd.get_dummies(df['current_rank'])
current_rank_df.columns = current_rank_df.columns.str.replace(' ','_')
# drop UNKNOWN column (avoid perfect multicoll)
current_rank_df = current_rank_df.drop(columns='UNKNOWN')
# rename other column so we don't have 2 of them
current_rank_df = current_rank_df.rename(columns={'Other':'Other_rank'})

# Combine complaint_category + unit_description data frames 
complaint_unit_df = pd.concat([complaint_df, unit_df], axis=1)
# Combine gender and race 
gender_race_df = pd.concat([race_df, gender_df], axis=1)
# gender + race + unit + complaint category 
gender_race_unit_complaint_df = pd.concat([unit_df, complaint_df, gender_df, race_df], axis=1)
# Create full data frame 
full_df = pd.concat([gender_race_unit_complaint_df, duty_df, current_rank_df], axis=1)
# Add our 2 continuous variables
full_df['salary'] = df['salary']
full_df['officer_age'] = df['officer_age'] 
# Add disciplined
full_df['y'] = df['disciplined']

# Clean up column names (make compatiable with smf.)
full_df.columns = full_df.columns.str.replace(' ', '_')
full_df.columns = full_df.columns.str.replace('/', '_')
full_df.columns = full_df.columns.str.replace(')', '')
full_df.columns = full_df.columns.str.replace('(', '_')
full_df.columns = full_df.columns.str.replace('-', '_')
full_df.columns = full_df.columns.str.replace('.', '')
full_df.columns = full_df.columns.str.replace('&', '_')

2.2.1 Data preparation

By Sankaranarayanan Balasubramanian and Chun-Li

The following data preparation steps helped us to prepare our data for implementing various modeling / validation techniques:

  1. Since we need to predict house price, we derived some new predictors (from existing predictors) that intuitively seem to be helpuful to predict house price.

  2. We have shuffled the dataset to prepare it for K-fold cross validation.

  3. We have created a standardized version of the dataset, as we will use it to develop Lasso / Ridge regression models.

#Inference 4 prep
getsust = pd.read_csv('officers_only.csv')
cln = pd.read_csv('cleaned_data.csv')

full = pd.concat([cln, getsust[['sustained']]], axis=1)

sust_only = full.loc[full['sustained'] == 1]
sust_only = sust_only.drop(columns = {'sustained'})
sust_only = sust_only.drop(columns = {'Unnamed: 0'})
/var/folders/4b/sl3y8zds5g52xl7kl2dsn52w0000gn/T/ipykernel_20516/1618729323.py:2: DtypeWarning: Columns (24) have mixed types. Specify dtype option on import or set low_memory=False.
  getsust = pd.read_csv('officers_only.csv')
######---------------Creating new predictors----------------#########

#Creating number of bedrooms per unit floor area

#Creating ratio of bathrooms to bedrooms

#Creating ratio of carpet area to floor area
# Jitters points to make distribution more visible 
def jitter(values,j):
    return values + np.random.normal(j,0.01,values.shape)

# Plot response variable vs salary and officer age, see whether binning is needed
plt.figure(figsize=(14,6))
plt.subplot(1, 2, 1)

ax=sns.scatterplot(df['salary'], jitter(df['disciplined'],0), s=15, edgecolors='white', linewidths=0.2)
plt.xlabel('Salary')
plt.ylabel('Discplined')
ax.xaxis.set_major_formatter('${x:,.0f}')

plt.subplot(1, 2, 2)

ax=sns.scatterplot(full_df['officer_age'], jitter(full_df['y'],0), s=15, edgecolors='white', linewidths=0.2)
plt.xlabel('Officer Age')
plt.ylabel('Discplined')
Text(0, 0.5, 'Discplined')

Given the non-uniform distribution of salaries between the two response values, binning salary may be appropriate.

# Bin salary data

def var_transform(data):
    binned_salary = pd.qcut(data['salary'],4,retbins=True)
    bins = binned_salary[1]
    data['salary_binned'] = pd.cut(data['salary'],bins = bins)
    dum = pd.get_dummies(data.salary_binned,drop_first = True)
    dum.columns = ['salary'+str(x) for x in range(1,len(bins)-1)]
    data = pd.concat([data,dum], axis = 1)
    return data

salary_bin_data=var_transform(full_df)
# Check whether probability of response changes non-monotonically across bins and whether bin number is appropriate
group_salary = salary_bin_data.groupby('salary_binned')['y'].agg([('p_disciplined','mean'),('nobs','count')]).reset_index(drop=False)

plt.plot(group_salary['salary_binned'].astype(str), group_salary['p_disciplined'])
plt.xlabel('Salary $')
plt.ylabel('P(disciplined)')
plt.show()

Binning or quadratically transforming salary appears appropriate

# Turn salary bins into dummy variables and concat to main dataframe 
salary_D = pd.get_dummies(full_df['salary_binned'])
temp = pd.DataFrame(columns=['salary_1', 'salary_2', 'salary_3', 'salary_4'])
temp['salary_1'] = salary_D.iloc[:,0]
temp['salary_2'] = salary_D.iloc[:,1]
temp['salary_3'] = salary_D.iloc[:,2]
temp['salary_4'] = salary_D.iloc[:,3]
# concat to data frame, drop regular salary and salary binned 
full_df = pd.concat([full_df, temp], axis=1)
full_df = full_df.drop(columns=['salary_binned', 'salary'])
######-----------Shuffling the dataset for K-fold------------#########
######-----Standardizing the dataset for Lasso / Ridge-------#########

2.3 Exploratory data analysis

Put code with comments. The comments should explain the code such that it can be easily understood. You may put text (in a markdown cell) before a large chunk of code to explain the overall purpose of the code, if it is not intuitive. Put the name of the person / persons who contributed to each code chunk / set of code chunks.

2.3.1 Inference 1: What variables are most associated with whether an officer is disciplined or not?

Strategy: Start with full model –> perform back selection, iteratively removing the predictor with the largest p-value until no more predictors have a p-value of greater than 0.05 –> check for multicollinearity with VIF

# All predictors 
Xs = list(full_df.columns.difference(['y']))
full_model = smf.logit('y~{}'.format('+'.join(Xs)), data = full_df).fit(maxiter=500, disp=False)
print('Percentage of 96 predictors with p-values greater than 0.05:',(full_model.pvalues[1:] > 0.05).sum() / len(Xs)*100,'%')
Percentage of 96 predictors with p-values greater than 0.05: 84.375 %
# Perform backward selection, removing p-values

# Start with full model 
logit_model = full_model
Xs = list(full_df.columns.difference(['y']))

while True:
    max_pvalue = max(logit_model.pvalues)
    if max_pvalue > 0.05:
        # remove the predictor with the largest p-value
        predictor_to_remove = logit_model.pvalues.idxmax()
        Xs.remove(predictor_to_remove)
        logit_model = smf.logit('y~{}'.format('+'.join(Xs)), data=full_df).fit(disp=False)
    else:
        break
        
final_back_selected_mod = logit_model
print(final_back_selected_mod.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:               194142
Model:                          Logit   Df Residuals:                   194098
Method:                           MLE   Df Model:                           43
Date:                Tue, 14 Mar 2023   Pseudo R-squ.:                  0.1168
Time:                        14:15:38   Log-Likelihood:                -31133.
converged:                       True   LL-Null:                       -35251.
Covariance Type:            nonrobust   LLR p-value:                     0.000
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                         -3.7544      0.134    -28.119      0.000      -4.016      -3.493
ASIA_PAC_NATIV_AM                 -0.4838      0.091     -5.299      0.000      -0.663      -0.305
BUREAU_OF_INTERNAL_AFFAIRS        -0.6114      0.170     -3.597      0.000      -0.945      -0.278
Bribery___Official_Corruption      1.9319      0.175     11.067      0.000       1.590       2.274
CENTRAL_DETENTION_UNIT             0.5467      0.109      5.024      0.000       0.333       0.760
CHIEF                             -1.1168      0.308     -3.628      0.000      -1.720      -0.514
COMMANDER                         -0.8792      0.210     -4.180      0.000      -1.291      -0.467
Conduct_Unbecoming__Off_duty       1.9287      0.106     18.227      0.000       1.721       2.136
Criminal_Misconduct                1.1719      0.111     10.567      0.000       0.955       1.389
DETECTIVE_AREA___CENTRAL          -0.1535      0.073     -2.102      0.036      -0.297      -0.010
DETECTIVE_AREA___SOUTH            -0.2004      0.071     -2.823      0.005      -0.340      -0.061
DETECTIVE_SECTION___AREA_4        -0.7295      0.286     -2.553      0.011      -1.290      -0.169
DISTRICT_001                       0.1584      0.055      2.868      0.004       0.050       0.267
DISTRICT_002                       0.2082      0.049      4.207      0.000       0.111       0.305
DISTRICT_003                      -0.1698      0.065     -2.623      0.009      -0.297      -0.043
DISTRICT_010                       0.1707      0.073      2.337      0.019       0.028       0.314
DISTRICT_013                       0.3508      0.149      2.352      0.019       0.058       0.643
DISTRICT_014                       0.2808      0.075      3.756      0.000       0.134       0.427
DISTRICT_018                       0.1508      0.056      2.676      0.007       0.040       0.261
DISTRICT_019                       0.2526      0.062      4.086      0.000       0.131       0.374
DISTRICT_025                       0.2182      0.075      2.927      0.003       0.072       0.364
DISTRICT_REINSTATEMENT_UNIT        0.7166      0.150      4.767      0.000       0.422       1.011
Domestic                           1.3951      0.103     13.526      0.000       1.193       1.597
Drug___Alcohol_Abuse               3.3041      0.123     26.953      0.000       3.064       3.544
False_Arrest                      -1.5978      0.281     -5.695      0.000      -2.148      -1.048
GANG_ENFORCEMENT_INVESTIGATION    -0.4031      0.134     -3.007      0.003      -0.666      -0.140
HISPANIC                          -0.4504      0.037    -12.092      0.000      -0.523      -0.377
Illegal_Search                    -1.6488      0.150    -11.011      0.000      -1.942      -1.355
Lockup_Procedures                  1.5800      0.093     17.049      0.000       1.398       1.762
MALE                              -0.1351      0.029     -4.675      0.000      -0.192      -0.078
MOBILE_STRIKE_FORCE               -0.5607      0.242     -2.319      0.020      -1.035      -0.087
OEMC___DETAIL_SECTION              0.7195      0.165      4.371      0.000       0.397       1.042
ON                                -0.7329      0.055    -13.256      0.000      -0.841      -0.625
Operation_Personnel_Violations     1.6856      0.085     19.825      0.000       1.519       1.852
PUBLIC_TRANSPORTATION_SECTION      0.1737      0.066      2.639      0.008       0.045       0.303
RECRUIT_TRAINING_SECTION           2.6503      0.229     11.587      0.000       2.202       3.099
Supervisory_Responsibilities       0.9505      0.131      7.264      0.000       0.694       1.207
Traffic                            1.2952      0.102     12.714      0.000       1.095       1.495
Use_of_Force                       0.4905      0.092      5.338      0.000       0.310       0.671
WHITE                             -0.6969      0.027    -26.040      0.000      -0.749      -0.644
officer_age                        0.0080      0.002      4.564      0.000       0.005       0.011
salary_2                           0.2037      0.039      5.286      0.000       0.128       0.279
salary_3                           0.3069      0.042      7.342      0.000       0.225       0.389
salary_4                           0.3097      0.049      6.342      0.000       0.214       0.405
==================================================================================================
CHIEF                             -1.1168      0.308     -3.628      0.000      -1.720      -0.514
COMMANDER                         -0.8792 
final_back_selected_mod.params.sort_values().tail(5)
Operation_Personnel_Violations    1.685647
Conduct_Unbecoming__Off_duty      1.928682
Bribery___Official_Corruption     1.931911
RECRUIT_TRAINING_SECTION          2.650347
Drug___Alcohol_Abuse              3.304148
dtype: float64
1 - np.exp(-1.597846)
0.7976681281978599
1 - np.exp(-1.648782)
0.807716032077367
np.exp(1.685647)
5.395940977010438
full_df
ADMIN_SCHOOL_SECURIT AIRPORT_RELATED BOMB_AND_ARSON_DIVISION BUREAU_OF_INTERNAL_AFFAIRS BUREAU_OF_ORGANIZED_CRIME CENTRAL_DETENTION_UNIT CENTRAL_INVESTIGATIONS_DIVISION DET_DIV_ADMIN DETACHED_SERVICES___GOVERMENT_SECURITY DETACHED_SERVICES___MISCELLANEOUS_DETAIL ... COMMANDER Other_rank PO_AS_DETECTIVE POLICE_OFFICER officer_age y salary_1 salary_2 salary_3 salary_4
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 69.0 1 0 0 0 1
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 69.0 0 0 0 0 1
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 69.0 0 0 0 0 1
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 69.0 0 0 0 0 1
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 69.0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2519371 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 49.0 1 1 0 0 0
2519373 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 31.0 1 1 0 0 0
2519375 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 29.0 1 1 0 0 0
2519380 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 32.0 1 1 0 0 0
2519385 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0 0 0 1 40.0 1 1 0 0 0

194142 rows × 97 columns

len(final_back_selected_mod.params)
44

The back selected model removed 53 predictors from the full model. The next thing to do is to add an additional effect size filter. We are only interested in the predictors that lead to at least a 5% change in the odds ratio of being disciplined.

# Get p-value back selected model's coefficients and 
p_back_selected_coefs = list(final_back_selected_mod.params[1:])
p_back_selected_params = list(final_back_selected_mod.params.index.difference(['Intercept']))
odds_ratios = np.exp(p_back_selected_coefs)
effect_sizes = abs(1 - odds_ratios)
# Find if any significant predictors were really weak (changed odds ratio by less than 10%)
weak_ind  = np.where(effect_sizes < 0.05)
print('Weak predictor to remove:', p_back_selected_params[weak_ind[0][0]])
Weak predictor to remove: officer_age

Finally, let’s check that there aren’t severe multicollinearity problems. By performing p-value back selection, we already likely removed many of the most multicollinear predictors, since predictors with high multicollinearity have large estimated standard errors and therefore large p-values. In fact, this was part of the motivation for back selecting in this manner.

def find_VIF(X): # takes matrix of predictors, calculates VIF for each
    
    X = sm.add_constant(X)
    vif_data = pd.DataFrame()
    vif_data["feature"] = X.columns

    for i in range(len(X.columns)):
        vif_data.loc[i,'VIF'] = variance_inflation_factor(X.values, i)

    return vif_data.sort_values(by='VIF', ascending=False)
df['current_rank'].value_counts()
POLICE OFFICER     169490
PO AS DETECTIVE     21989
COMMANDER            1241
CHIEF                 761
Other                 648
UNKNOWN                13
Name: current_rank, dtype: int64
# Remove officer_age
#p_back_selected_params.remove('officer_age')
# Calculate VIF
vif = find_VIF(full_df[p_back_selected_params])
find_VIF(full_df.drop('y', axis=1))
feature VIF
0 const 22062.261156
91 POLICE_OFFICER 2139.120552
90 PO_AS_DETECTIVE 1944.288618
85 WHITE 841.181123
83 BLACK 671.560549
... ... ...
9 DETACHED_SERVICES___GOVERMENT_SECURITY 4.164397
93 salary_1 3.454936
92 officer_age 2.440342
86 ON 1.826083
81 MALE 1.051903

97 rows × 2 columns

plt.bar(vif.iloc[1:,0], vif.iloc[1:,1])
plt.xticks([], [])
plt.xlabel('Predictors')
plt.ylabel('VIF')
plt.show()

print('The 3 predictors with the highest VIF are:', vif.iloc[1,0],',',vif.iloc[2,0],',',vif.iloc[3,0])
The 3 predictors with the highest VIF are: Operation_Personnel_Violations , Use_of_Force , Illegal_Search

The predictors only have moderate multicollinearity, indicating that the back selected model is stable and, crucially, that the p-values of the predictors come from precise estimates of their standard errors.

Inference 3 - Sustained or not sustained

df = pd.read_csv('officers_only.csv', low_memory=False)

yes = df[df['sustained'] ==1]
no = df[df['sustained'] == 0].sample(yes.shape[0])
df = pd.concat([no, yes])
sns.pairplot(data=df)
<seaborn.axisgrid.PairGrid at 0x7fabd0c74a90>

df['salary_binned'] = pd.qcut(df['salary'], q=5)
sns.barplot(data=df, y='sustained', x='salary_binned')
<AxesSubplot:xlabel='salary_binned', ylabel='sustained'>

ax = sns.barplot(data=df, y='sustained', x='complaint_category')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize=12, horizontalalignment='right')
plt.show()

ax = sns.barplot(data=df, y='sustained', x='officer_race')
ax.set_xticklabels(ax.get_xticklabels(), rotation=10)
plt.show()

ax = sns.barplot(data=df, y='sustained', x='on_off_duty')

df[df['on_off_duty'] == 'OFF']['complaint_category'].value_counts()
Conduct Unbecoming (Off-duty)    1854
Use of Force                     1802
Drug / Alcohol Abuse              229
Domestic                            1
Name: complaint_category, dtype: int64
df['complaint_category'].value_counts()
Operation/Personnel Violations    12415
Use of Force                       5304
Illegal Search                     2658
Lockup Procedures                  2097
Conduct Unbecoming (Off-duty)      1854
Verbal Abuse                       1094
Traffic                            1071
Domestic                            825
Criminal Misconduct                 749
False Arrest                        557
Supervisory Responsibilities        498
Drug / Alcohol Abuse                347
Bribery / Official Corruption       111
Other                                14
Name: complaint_category, dtype: int64
logit_model = smf.logit(formula = "sustained~complaint_category+salary_binned+unit_description+officer_race+on_off_duty+officer_age+I(officer_age**2)+salary_binned+officer_race+on_off_duty+officer_age+I(officer_age**2)", data = df).fit()

This is not the final prediction model but was the formula used for the model to determined the most important variables associated with whether a complaint was sustained or not. Complaint category was easily the most significant. As for as transformations a quadratic transformation of officer age was appropriate based on the shape of the graph and binned salary was another variable that affected the outcome.

Inference 4

getsust = pd.read_csv('officers_only.csv')
df = pd.read_csv('cleaned_data.csv')

full_df = pd.concat([df, getsust[['sustained']]], axis=1)

sust_only = full_df.loc[full_df['sustained'] == 1]
sust_only = sust_only.drop(columns = {'sustained'})
sust_only = sust_only.drop(columns = {'Unnamed: 0'})
/var/folders/4b/sl3y8zds5g52xl7kl2dsn52w0000gn/T/ipykernel_20644/75206399.py:1: DtypeWarning: Columns (24) have mixed types. Specify dtype option on import or set low_memory=False.
  getsust = pd.read_csv('officers_only.csv')
#Dataframe showing the correlation of each columnn with the target variable
y_df = pd.DataFrame(sust_only.corr()["y"]).sort_values(by = 'y', ascending = False)


# Lasso Regression
q = list(sust_only.columns)
q.remove('y')
G = sust_only[q]

alphas = 10**np.linspace(0,-4,200)*0.5

#Define scaler
scaler = StandardScaler()
scaler.fit(G)
Gstd = scaler.transform(G)
targ = sust_only.y

lassocv = LassoCV(alphas = alphas, cv = 10, max_iter = 100000)
lassocv.fit(Gstd, targ)
lassocv.alpha_
0.0029363933065947416
lasso = Lasso(alpha = lassocv.alpha_)
lasso.fit(Gstd, targ)

# create a dataframe with the coefficients from the lasso regression
coefdf = pd.concat([pd.Series(G.columns), pd.Series(lasso.coef_)],axis=1).sort_values(by = 1, ascending = True)
def vif(X):
    X = add_constant(X)
    vif_data = pd.DataFrame()
    vif_data["feature"] = X.columns
    
    for i in range(len(X.columns)):
        vif_data.loc[i, 'VIF'] = variance_inflation_factor(X.values, i)
        
    return(vif_data)
# use VIF to determine and remove multicollinear predictors
l = coefdf.iloc[:, 0].tolist()
lass = sust_only[l]

#drop rows with over 5 VIF
viflassdf = vif(lass)
new = viflassdf[viflassdf['VIF'] <= 5]

3 Developing the prediction model

3.0.1 Code fitting the prediction model (Alan Senkus)

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import confusion_matrix
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import precision_score, recall_score
import time
import warnings

The commented out line(s) were used to fit the model to a subset of the data to the sustained allegations.

data_cleaned = pd.read_csv('cleaned_data_w_uids.csv', low_memory=False, index_col=[0])
data_officers_fulldf = pd.read_csv('officers_only_fulldf.csv', low_memory=False, index_col=[0])

data_cleaned = data_cleaned.rename(columns={'y': 'disciplined'})

# data_cleaned = data_cleaned[data_cleaned['sustained'] == 1]

data_cleaned.index = data_officers_fulldf.index

# merge, but if the columns are the same, keep the ones from the clean data
merged = data_officers_fulldf.merge(data_cleaned, how='outer', left_index=True, right_index=True, validate='1:1', suffixes=('_off', ''))
merged = merged.loc[:, ~merged.columns.str.endswith('_off')]

merged.drop(['UID'], axis=1, inplace=True)

# merged = merged[merged['sustained'] == 1]

data_cleaned = merged

Some early leftover EDA.

# data_cleaned.corr()['sustained'].sort_values(ascending=False)

# Sustained shouldn't be a predictor. We want to look at all allegations.
data_sustained = data_cleaned['sustained']
data_cleaned.drop('sustained', axis=1, inplace=True)

# What percentage of complaints are disciplined?
data_cleaned['disciplined'].value_counts(normalize=True)
0    0.955636
1    0.044364
Name: disciplined, dtype: float64

Noise code credit to Alex

# Add some noise to salary
data_cleaned['salary'] = (data_cleaned['salary'] + np.random.normal(0,5,data_cleaned.shape[0]))
# Add some noise to age
data_cleaned['officer_age'] = data_cleaned['officer_age'] + np.random.normal(0,1.5,data_cleaned.shape[0])

This was leftover from quick prototyping as VIF analysis took a long time.

# Only include columns with correlation >0.005 with disciplined (quick and dirty but theres a LOT of columns so)
# data_cleaned = data_cleaned[data_cleaned.columns[data_cleaned.corr()['disciplined'].abs() > 0.005]]
# Split the data into features (X) and target (y)
X = data_cleaned.drop('disciplined', axis=1)
y = data_cleaned['disciplined']

# add intercept
# X = sm.add_constant(X)

# Remove all perfect correlations with other variables
# X = X.loc[:,~X.columns.duplicated()]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# combine X_train_resampled and y_train_resampled into a dataframe
# X_train_resampled = pd.DataFrame(X_train_resampled, columns=X_train.columns)

We eliminate columns that don’t change at all.

# Drop singular matrix columns
sing_matrix = []
for col in X_train.columns:
    if X_train[col].nunique() == 1:
        sing_matrix.append(col)

X_train.drop(sing_matrix, axis=1, inplace=True)
X_test.drop(sing_matrix, axis=1, inplace=True)

As outlined in the Final Report, we chose 20 for our VIF threshold to give it a little more wiggle room. Our data is very sparse & from the real world, so we wanted to give ourselves a chance.

# Calculate VIF for each feature
vif = pd.DataFrame()
vif['VIF Factor'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]

# Remove features from X_train and X_test with VIF > 10
vif['features'] = X_train.columns
vif = vif[vif['VIF Factor'] > 40]

X_train.drop(vif['features'], axis=1, inplace=True)
X_test.drop(vif['features'], axis=1, inplace=True)

X_test = sm.add_constant(X_test)
/opt/anaconda3/lib/python3.9/site-packages/statsmodels/stats/outliers_influence.py:195: RuntimeWarning: divide by zero encountered in double_scalars
  vif = 1. / (1. - r_squared_i)

SMOTE was a suggestion by Prof Krish after our presentation – Thank you! We used it to balance the apperance of the two classes in our data. We experimented with different ratios, but did not see a significant difference in the results.

# Apply SMOTE to the training set
smote = SMOTE(sampling_strategy=1, random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

Forward selection code, adapted from the class notes. Here, we minimize the AIC.

Part of the reason we chose AIC was because of this: “The Bayesian Information Criterion (BIC) is more useful in selecting a correct model while the AIC is more appropriate in finding the best model for predicting future observations.” (https://www.sciencedirect.com/science/article/pii/B9780444518620500186)

#Function to develop a model based on all predictors in predictor_subset
def processSubset(predictor_subset):
    # Fit model on feature_set and calculate R-squared
    # print("Predictors:", predictor_subset)
    # print("Y:", y_train_resampled)
    # print("X:", X_train_resampled[predictor_subset])
    # always include intercept
    # predictor_subset = predictor_subset + ['const']
    with_intercept = pd.DataFrame(sm.add_constant(X_train_resampled[predictor_subset]))

    model = sm.Logit(y_train_resampled, with_intercept).fit(disp=False)
    # max recall
    # model = sm.Logit(y_train_resampled, X_train_resampled).fit(disp=False, method='newton')
    score = model.aic
    return {"model":model, "score":score}

#Function to find the best predictor out of p-k predictors and add it to the model containing the k predictors
def forward(pred):

    # Pull out predictors we still need to process
    remaining_predictors = [p for p in X_train_resampled.columns if p not in pred]
    
    tic = time.time()
    
    results = []
    
    for p in remaining_predictors:
        results.append(processSubset(pred+[p]))
    
    # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    
    # Choose the model with the lower AIC
    best_model = models.loc[models['score'].argmin()]
    
    toc = time.time()
    print("Processed ", models.shape[0], "models on", len(pred)+1, "predictors in", (toc-tic), "seconds.")
    
    # Return the best model, along with some other useful information about the model
    return best_model

def forward_selection():
    models_best = pd.DataFrame(columns=["score", "model"])

    tic = time.time()
    predictors = []

    for i in range(1,len(X_train_resampled.columns)+1):
        # print("Progress:", i, "/", len(X_train_resampled.columns))
        models_best.loc[i] = forward(predictors)
        predictors = list(models_best.loc[i]["model"].params.index[1:])

    toc = time.time()
    print("Total elapsed time:", (toc-tic), "seconds.")
    return models_best

models_best = forward_selection()
Processed  30 models on 1 predictors in 1.5637402534484863 seconds.
Processed  29 models on 2 predictors in 2.0958340167999268 seconds.
Processed  28 models on 3 predictors in 2.447935104370117 seconds.
Processed  27 models on 4 predictors in 2.4803669452667236 seconds.
Processed  26 models on 5 predictors in 2.8363490104675293 seconds.
Processed  25 models on 6 predictors in 3.134242057800293 seconds.
Processed  24 models on 7 predictors in 3.307857036590576 seconds.
Processed  23 models on 8 predictors in 3.3351292610168457 seconds.
Processed  22 models on 9 predictors in 3.304680824279785 seconds.
Processed  21 models on 10 predictors in 3.379564046859741 seconds.
Processed  20 models on 11 predictors in 3.352635145187378 seconds.
Processed  19 models on 12 predictors in 4.098873853683472 seconds.
Processed  18 models on 13 predictors in 3.4987878799438477 seconds.
Processed  17 models on 14 predictors in 3.57153582572937 seconds.
Processed  16 models on 15 predictors in 3.7793631553649902 seconds.
Processed  15 models on 16 predictors in 3.4337270259857178 seconds.
Processed  14 models on 17 predictors in 3.1700479984283447 seconds.
Processed  13 models on 18 predictors in 3.2971270084381104 seconds.
Processed  12 models on 19 predictors in 2.9969730377197266 seconds.
Processed  11 models on 20 predictors in 3.052708148956299 seconds.
Processed  10 models on 21 predictors in 3.0324201583862305 seconds.
Processed  9 models on 22 predictors in 2.602311134338379 seconds.
Processed  8 models on 23 predictors in 2.3304378986358643 seconds.
Processed  7 models on 24 predictors in 2.365370750427246 seconds.
Processed  6 models on 25 predictors in 2.0494720935821533 seconds.
Processed  5 models on 26 predictors in 1.7375271320343018 seconds.
Processed  4 models on 27 predictors in 1.4599499702453613 seconds.
Processed  3 models on 28 predictors in 1.0515141487121582 seconds.
Processed  2 models on 29 predictors in 0.8562498092651367 seconds.
Processed  1 models on 30 predictors in 0.5110898017883301 seconds.
Total elapsed time: 80.5423309803009 seconds.
models_best['score'] = models_best['score'].astype(float)
models_best.sort_values(by='score', ascending=True, inplace=True)
best_model = models_best.iloc[0]['model']

best_model_params = best_model.params.index.tolist()
# X_train = X_train[best_model_params]
X_test_best = X_test[best_model_params]

best_model.summary()
Logit Regression Results
Dep. Variable: disciplined No. Observations: 296786
Model: Logit Df Residuals: 296758
Method: MLE Df Model: 27
Date: Wed, 15 Mar 2023 Pseudo R-squ.: 0.03464
Time: 23:39:04 Log-Likelihood: -1.9859e+05
converged: True LL-Null: -2.0572e+05
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
const 1.0667 0.014 78.533 0.000 1.040 1.093
ON -0.8556 0.011 -76.775 0.000 -0.877 -0.834
Drug___Alcohol_Abuse 2.0675 0.056 36.742 0.000 1.957 2.178
MALE -0.4324 0.010 -42.859 0.000 -0.452 -0.413
RECRUIT_TRAINING_SECTION 2.3656 0.152 15.588 0.000 2.068 2.663
DISTRICT_REINSTATEMENT_UNIT 1.2144 0.067 18.204 0.000 1.084 1.345
OEMC___DETAIL_SECTION 1.4279 0.084 17.079 0.000 1.264 1.592
DETECTIVE_SECTION___AREA_4 -1.5707 0.110 -14.258 0.000 -1.787 -1.355
Bribery___Official_Corruption 0.9190 0.061 15.154 0.000 0.800 1.038
CENTRAL_INVESTIGATIONS_DIVISION -1.1213 0.095 -11.755 0.000 -1.308 -0.934
Supervisory_Responsibilities -0.3038 0.033 -9.302 0.000 -0.368 -0.240
MOUNTED_UNIT -0.8688 0.101 -8.572 0.000 -1.067 -0.670
BUREAU_OF_ORGANIZED_CRIME -2.2350 0.365 -6.123 0.000 -2.950 -1.520
OFFICE_OF_THE_FIRST_DEPUTY_SUPERINTENDENT -0.7498 0.115 -6.518 0.000 -0.975 -0.524
MARINE_OPERATIONS_UNIT -0.4290 0.073 -5.917 0.000 -0.571 -0.287
HUMAN_RESOURCES_DIVISION -0.5299 0.103 -5.159 0.000 -0.731 -0.329
DETACHED_SERVICES___MISCELLANEOUS_DETAIL 0.5813 0.117 4.976 0.000 0.352 0.810
BOMB_AND_ARSON_DIVISION 0.2911 0.074 3.937 0.000 0.146 0.436
MAJOR_ACCIDENT_INVESTIGATION_UNIT -0.3106 0.080 -3.860 0.000 -0.468 -0.153
PREVEN___NEIGH_DIV 0.5266 0.166 3.169 0.002 0.201 0.852
POLICE_DOCUMENTS_SECTION -0.6116 0.200 -3.051 0.002 -1.004 -0.219
SPECIAL_INVESTIGATIONS_UNIT -0.2175 0.083 -2.626 0.009 -0.380 -0.055
FIELD_SERVICES_SECTION -0.2754 0.108 -2.558 0.011 -0.486 -0.064
INTELLIGENCE_SECTION -0.6550 0.281 -2.335 0.020 -1.205 -0.105
INSPECTION_DIVISION 0.2306 0.115 2.007 0.045 0.005 0.456
DET_DIV_ADMIN -0.3208 0.183 -1.753 0.080 -0.679 0.038
ADMIN_SCHOOL_SECURIT -0.2834 0.189 -1.497 0.134 -0.654 0.088
PUBLIC_HOUSING -0.1175 0.079 -1.479 0.139 -0.273 0.038

We also compared it to a base logit model.

# full logit model (all predictors available in forward selection)
full_logit = sm.Logit(y_train_resampled, sm.add_constant(X_train_resampled)).fit(disp=False)
full_logit.summary()
Logit Regression Results
Dep. Variable: disciplined No. Observations: 296786
Model: Logit Df Residuals: 296755
Method: MLE Df Model: 30
Date: Wed, 15 Mar 2023 Pseudo R-squ.: 0.03465
Time: 23:39:05 Log-Likelihood: -1.9859e+05
converged: True LL-Null: -2.0572e+05
Covariance Type: nonrobust LLR p-value: 0.000
coef std err z P>|z| [0.025 0.975]
const 1.0670 0.014 78.525 0.000 1.040 1.094
ADMIN_SCHOOL_SECURIT -0.2838 0.189 -1.499 0.134 -0.655 0.087
BOMB_AND_ARSON_DIVISION 0.3006 0.075 4.022 0.000 0.154 0.447
BUREAU_OF_ORGANIZED_CRIME -2.2355 0.365 -6.124 0.000 -2.951 -1.520
CENTRAL_INVESTIGATIONS_DIVISION -1.1217 0.095 -11.759 0.000 -1.309 -0.935
DET_DIV_ADMIN -0.3213 0.183 -1.756 0.079 -0.680 0.037
DETACHED_SERVICES___GOVERMENT_SECURITY 0.0151 0.232 0.065 0.948 -0.440 0.470
DETACHED_SERVICES___MISCELLANEOUS_DETAIL 0.5808 0.117 4.973 0.000 0.352 0.810
DETECTIVE_SECTION___AREA_4 -1.5711 0.110 -14.262 0.000 -1.787 -1.355
DISTRICT_REINSTATEMENT_UNIT 1.2140 0.067 18.198 0.000 1.083 1.345
FIELD_SERVICES_SECTION -0.2758 0.108 -2.562 0.010 -0.487 -0.065
HUMAN_RESOURCES_DIVISION -0.5298 0.103 -5.157 0.000 -0.731 -0.328
INSPECTION_DIVISION 0.2301 0.115 2.003 0.045 0.005 0.455
INTELLIGENCE_SECTION -0.6555 0.281 -2.336 0.019 -1.205 -0.106
MAJOR_ACCIDENT_INVESTIGATION_UNIT -0.3110 0.080 -3.865 0.000 -0.469 -0.153
MARINE_OPERATIONS_UNIT -0.4294 0.073 -5.922 0.000 -0.572 -0.287
MOUNTED_UNIT -0.8692 0.101 -8.577 0.000 -1.068 -0.671
OEMC___DETAIL_SECTION 1.4275 0.084 17.074 0.000 1.264 1.591
OFFICE_OF_THE_FIRST_DEPUTY_SUPERINTENDENT -0.7311 0.117 -6.248 0.000 -0.960 -0.502
PATROL -0.0938 0.071 -1.326 0.185 -0.233 0.045
POLICE_DOCUMENTS_SECTION -0.6121 0.200 -3.054 0.002 -1.005 -0.219
PREVEN___NEIGH_DIV 0.5261 0.166 3.166 0.002 0.200 0.852
PUBLIC_HOUSING -0.1179 0.079 -1.485 0.138 -0.274 0.038
RECRUIT_TRAINING_SECTION 2.3652 0.152 15.585 0.000 2.068 2.663
SPECIAL_INVESTIGATIONS_UNIT -0.2180 0.083 -2.631 0.009 -0.380 -0.056
Bribery___Official_Corruption 0.9187 0.061 15.149 0.000 0.800 1.038
Drug___Alcohol_Abuse 2.0672 0.056 36.736 0.000 1.957 2.178
Supervisory_Responsibilities -0.3025 0.033 -9.258 0.000 -0.367 -0.238
MALE -0.4323 0.010 -42.831 0.000 -0.452 -0.413
ON -0.8555 0.011 -76.749 0.000 -0.877 -0.834
Other_rank -0.0637 0.071 -0.897 0.370 -0.203 0.075
# base logit model (just the intercept)
base_logit = sm.Logit(y_train_resampled, sm.add_constant(pd.DataFrame(np.ones(y_train_resampled.shape[0])))).fit(disp=False)
base_logit.summary()
Logit Regression Results
Dep. Variable: disciplined No. Observations: 296786
Model: Logit Df Residuals: 296785
Method: MLE Df Model: 0
Date: Wed, 15 Mar 2023 Pseudo R-squ.: 0.000
Time: 23:39:05 Log-Likelihood: -2.0572e+05
converged: True LL-Null: -2.0572e+05
Covariance Type: nonrobust LLR p-value: nan
coef std err z P>|z| [0.025 0.975]
0 0 0.004 0 1.000 -0.007 0.007

Confusion matrix is adapted from the class notes.

threshold = 0.5

def plot_confusion_matrix(y_test, y_pred, name):
    # Confusion matrix plot (percentages) multiplot of each model

    plt.figure(figsize=(30, 10))
    plt.suptitle(name, fontsize=16, fontweight='bold')

    plt.subplot(1, 3, 1)

    cm = confusion_matrix(y_test, y_pred)
    # cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    sns.heatmap(cm, annot=True, cmap='Blues')
    plt.title('based on total observations (Num)')
    plt.xlabel('Predicted (Num)')
    plt.ylabel('Actual (Num)')

    # plt.show()

    plt.subplot(1, 3, 2)

    # cm = confusion_matrix(y_test, y_pred)
    # cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    # sum across all elements and divide by total number of elements
    cm2 = cm.astype('float') / cm.sum()
    sns.heatmap(cm2, annot=True, cmap='Reds')
    plt.title('Based on proportion of total observations')
    plt.xlabel('Predicted %')
    plt.ylabel('Actual %')

    plt.subplot(1, 3, 3)

    # cm = confusion_matrix(y_test, y_pred)
    cm3 = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    # sum across all elements and divide by total number of elements
    # cm = cm.astype('float') / cm.sum()
    sns.heatmap(cm3, annot=True, cmap='Greens')
    plt.title('Based on proportion of actual (row) observations')
    plt.xlabel('Predicted %')
    plt.ylabel('Actual %')

    # plt.show()

    # display precision and recall in text
    # plt.subplot(2, 3, 4)

    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)

    plt.text(0.1, -0.1, 'Precision: ' + str(prec), fontsize=20)
    plt.text(0.1, -0.2, 'Recall: ' + str(rec), fontsize=20)

    print('Precision of ' + name + ': ', precision_score(y_test, y_pred))
    print('Recall of ' + name + ': ', recall_score(y_test, y_pred))

def test_and_graph_model(model, x_test, y_test, name="Model"):
    # Test the model
    y_pred = model.predict(x_test)

    # Check the accuracy of the model
    y_pred = np.where(y_pred > threshold, 1, 0)
    print('Accuracy of the model: ', (y_pred == y_test).mean())

    plot_confusion_matrix(y_test, y_pred, name)

Here, we choose a minimum acceptable precision (0.1) and find the best recall we can get with that precision.

This strategy was adapted from: https://towardsdatascience.com/calculating-and-setting-thresholds-to-optimise-logistic-regression-performance-c77e6d112d7e

Although we tried it, F1 score and TPR-NPR did not work well for our model nor our requirements. Being explicit and maximizing recall on a minimum precision was the ultimate strategy we chose.

import numpy as np
from sklearn.metrics import precision_recall_curve

X_train['const'] = 1

def find_best_threshold_max_recall(model, X_test, y_test, min_precision=0.5):
    y_pred_probs = model.predict(X_test)
    precision, recall, thresholds = precision_recall_curve(y_test, y_pred_probs)
    
    # Find indices where precision is greater than or equal to the minimum acceptable precision
    acceptable_precision_indices = np.where(precision[:-1] >= min_precision)
    
    # Select the corresponding recall and threshold values
    acceptable_recall = recall[acceptable_precision_indices]
    acceptable_thresholds = thresholds[acceptable_precision_indices]
    
    # Find the index of the maximum recall value
    max_recall_index = np.argmax(acceptable_recall)
    
    # Find the best threshold value
    best_threshold = acceptable_thresholds[max_recall_index]

    return best_threshold

min_precision = 0.05 # This value can be adjusted.
best_threshold_max_recall = find_best_threshold_max_recall(best_model, X_test_best, y_test, min_precision)
print("Best threshold value for maximizing recall while maintaining a minimum precision of", min_precision, ":", best_threshold_max_recall)
Best threshold value for maximizing recall while maintaining a minimum precision of 0.05 : 0.44574496519007095
# regular models (no threshold adjustment)
threshold = 0.5

# print("### BEST Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(best_model, X_test_best, y_test, name="BEST Model, No Threshold Adjustment")

# print("### BASE Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(base_logit, X_test['const'], y_test, name="BASE Model, No Threshold Adjustment")

# print("### FULL Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(full_logit, X_test, y_test, name="FULL Model, No Threshold Adjustment")
Accuracy of the model:  0.7617502382240078
Precision of BEST Model, No Threshold Adjustment:  0.07926965041193498
Recall of BEST Model, No Threshold Adjustment:  0.4205552274069699
Accuracy of the model:  0.9563985680805583
Precision of BASE Model, No Threshold Adjustment:  0.0
Recall of BASE Model, No Threshold Adjustment:  0.0
Accuracy of the model:  0.7617502382240078
Precision of FULL Model, No Threshold Adjustment:  0.07926965041193498
Recall of FULL Model, No Threshold Adjustment:  0.4205552274069699
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
<Figure size 3000x1000 with 0 Axes>

<Figure size 3000x1000 with 0 Axes>

<Figure size 3000x1000 with 0 Axes>

# adjusted models (threshold adjustment)
threshold = best_threshold_max_recall

# print("### BEST Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(best_model, X_test_best, y_test, name="BEST Model, With Threshold Adjustment")

# print("### BASE Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(base_logit, X_test['const'], y_test, name="BASE Model, With  Threshold Adjustment")

# print("### FULL Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(full_logit, X_test, y_test, name="FULL Model, With Threshold Adjustment")
Accuracy of the model:  0.757887146205156
Precision of BEST Model, With Threshold Adjustment:  0.07907383136740935
Recall of BEST Model, With Threshold Adjustment:  0.42764323685764916
Accuracy of the model:  0.04360143191944166
Precision of BASE Model, With  Threshold Adjustment:  0.04360143191944166
Recall of BASE Model, With  Threshold Adjustment:  1.0
Accuracy of the model:  0.7576296067372325
Precision of FULL Model, With Threshold Adjustment:  0.07898756273183505
Recall of FULL Model, With Threshold Adjustment:  0.42764323685764916
<Figure size 3000x1000 with 0 Axes>

<Figure size 3000x1000 with 0 Axes>

<Figure size 3000x1000 with 0 Axes>

from sklearn.metrics import precision_recall_curve
ypred = best_model.predict(X_train[best_model_params])
p, r, thresholds = precision_recall_curve(y_train, ypred)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.figure(figsize=(8, 8))
    plt.title("Precision and Recall Scores as a function of the decision threshold")
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.ylabel("Score")
    plt.xlabel("Decision Threshold")
    plt.legend(loc='best')
    plt.legend()
plot_precision_recall_vs_threshold(p, r, thresholds)

3.0.2 This is how our model would perform on a test dataset with 50/50 split of disciplined and not.

Just for reference, assuming a beautiful world where our model might be useful!

# y_test_half and X_test_half are the test set with 50/50 appearance of 0 and 1

y_test_half = y_test.copy()
X_test_half = X_test.copy()

combined_test = pd.concat([X_test, y_test], axis=1)

# zeros and ones are the test set split into two groups, 0 and 1
zeros = combined_test[combined_test[y_test.name] == 0]
ones = combined_test[combined_test[y_test.name] == 1]

min_count = min(len(zeros), len(ones))
sampled_zeros = zeros.sample(min_count)
sampled_ones = ones.sample(min_count)

balanced_test = pd.concat([sampled_zeros, sampled_ones], axis=0)

balanced_test = balanced_test.sample(frac=1).reset_index(drop=True)

X_test_half = balanced_test.drop(columns=[y_test.name])
y_test_half = balanced_test[y_test.name]
# regular models (no threshold adjustment)
threshold = 0.5

# print("### BEST Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(best_model, X_test_half[best_model_params], y_test_half, name="BEST Model, No Threshold Adjustment")

# print("### BASE Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(base_logit, X_test_half['const'], y_test_half, name="BASE Model, No Threshold Adjustment")

# print("### FULL Model, No Threshold Adjustment ###")
plt.figure(figsize=(30, 10))
test_and_graph_model(full_logit, X_test_half, y_test_half, name="FULL Model, No Threshold Adjustment")
Accuracy of the model:  0.600708800945068
Precision of BEST Model, No Threshold Adjustment:  0.6574330563250231
Recall of BEST Model, No Threshold Adjustment:  0.4205552274069699
Accuracy of the model:  0.5
Precision of BASE Model, No Threshold Adjustment:  0.0
Recall of BASE Model, No Threshold Adjustment:  0.0
Accuracy of the model:  0.600708800945068
Precision of FULL Model, No Threshold Adjustment:  0.6574330563250231
Recall of FULL Model, No Threshold Adjustment:  0.4205552274069699
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
<Figure size 3000x1000 with 0 Axes>

<Figure size 3000x1000 with 0 Axes>

<Figure size 3000x1000 with 0 Axes>

3.1 Conclusions and Recommendations to stakeholder(s)

At this time, our conclusions to our shareholders are as follows: 1) More daata is needed to develop a useful predictive model, or 2) Human judgement is still needed to make the final decision on whether an officer should be disciplined or not, until GPT can learn to replace us in that too.

For more detailed information, see the final report.